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We propose a general theorem providing upper bounds for the 
risk of an empirical risk minimizer (ERM).We essentially focus on 
the binary classification framework. We extend Tsybakov's analysis 
of the risk of an ERM under margin type conditions by using con- 
centration inequalities for conveniently weighted empirical processes. 
This allows us to deal with ways of measuring the "size" of a class of 
classifiers other than entropy with bracketing as in Tsybakov's work. 
In particular, we derive new risk bounds for the ERM when the clas- 
sification rules belong to some VC-class under margin conditions and 
discuss the optimality of these bounds in a minimax sense. 

1. Introduction. The main results of this paper are obtained within the 
binary classification framework for which one observes n independent copies 
{Xi,Yi) , . . . , {Xn,Yn) of a pair (X,Y) of random variables, where X takes 
its values in some measurable space X and the response variable Y belongs 
to {0,1}. Denoting by P the joint distribution of {X,Y), the quality of a 
classifier t (i.e., a measurable mapping t : X — >{0, 1}) is measured by P{Y ^ 
t{X)). If the distribution P were known, the problem of finding an optimal 
classifier would be easily solved by considering the Bayes classifier s* defined 
for every x G X hy s*{x) = tr]{x)>i/2j where r/(x) = P[Y = 1\X = x] denotes 
the regression function of Y given X = x. The Bayes classifier s* is indeed 
known to minimize the probability of misclassification P(Y ^ t{X)) over 
the collection of all classifiers. The accuracy of a given classifier t is then 
measured by its relative loss with respect to the Bayes classifier i{s*,t) = 
P{Y 7^ t{X)) — P{Y 7^ s*{X)). The statistical learning problem consists in 
designing estimators of s* based on the sample {Xi,Yi), . . . ,{Xn,Yn) with 
as low probability of misclassification as possible. In the sequel we shall use 
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^ as a loss function and consider the expected risk E[£(s*,s)] to analyze the 
performance of a given estimator s. 

1.1. Empirical risk minimization. Given a class of measurable sets A, 
let S = {1a, A G A} be the corresponding class of classifiers. The empirical 
risk minimization (ERM) principle consists in taking as an estimator of s* 
some minimizer of the empirical criterion 

n 
i=l 

Choosing a proper model S among a given list in such a way that simul- 
taneously the bias inites ^{s* ,t) is small enough and the "size" of S is not 
too large represents the main challenge of model selection procedures. Since 
the early work of Vapnik and his celebrated book [23], there have been 
many works on this topic and several attempts to improve on the penaliza- 
tion method of the empirical risk (the structural risk minimization) initially 
proposed by Vapnik to select among a list of nested models with finite VC- 
dimensions. Our purpose in this paper is in some sense much less ambitious 
(although the final goal of our analysis is to design new penalization proce- 
dures), and we intend to address the problem of properly identifying what 
is the benchmark of our estimation problem. More precisely, we just want 
here to clarify and provide some answers to the following basic questions 
about the ERM estimators. Assuming first, for the sake of simplicity, that 
there is no bias, that is, that s* belongs to S: 

• What is the order of the expected risk of the ERM on S? 

• Is it minimax and in what sense? 

Of course, since the pioneering work of Vapnik, these questions have been 
addressed by several authors, but, as we shall see, there are some gaps in 
the theory. Our aim is to provide some rather complete and general analysis 
in order to present a unified view allowing a (maybe) better understanding 
of some already existing results and also to complete the theory by proving 
some new results. 

1.2. Known risk bounds. Let us begin with the case where ^ is a VC- 
class. 

1.2.1. Classical bounds for VC-classes. Recall that if m^(A^) denotes 
the supremum of H C, A S A} over the collection of subsets C oi X 
with cardinality N , then A has the Vapnik-Chervonenkis (VC) property iff 
V = sup{A^ : m_/[{N) = 2^} < oo and V is called the VC-dimension of A. If 
we denote by V{S) the set of all joint distributions P such that s* belongs 
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to S (we must keep in mind that the regression function rj as weh as the 
Bayes classifier s* depend on P), then (under some convenient measurabihty 
condition on A) the fohowing uniform risk bound is available for the ERM 
s (see [13], e.g.): 

sup E[£(s*,s)] < Ki 
P€V{S) 

where ki denotes some absolute constant. Note that the initial upper bounds 
on the expected risk for a VC-class found in [23] involved an extra logarith- 
mic factor because they were based on direct combinatorial methods on 
empirical processes. This factor can be removed (see [13]) by using chain- 
ing techniques and the notion of universal entropy, which was introduced 
in [10] and [19] independently. Furthermore, this upper bound is optimal in 
the minimax sense since, whenever 2 <V <n, one has (see [5]) 

inf sup E[i{s*,s)]>K2 
« pgV{S) 

for some absolute positive constant H2, where the infimum is taken over the 
family of all estimators. Apparently this sounds like the end of the story, but 
one should realize that this minimax point of view is indeed over-pessimistic. 
As noted by Vapnik and Chervonenkis themselves in [24], in the (of course, 
over-optimistic!) situation where Y = r](X) almost surely (in this case P 
is called a zero-error distribution), restricting the set of joint distributions 
to be with zero-error, the order of magnitude of the minimax lower bound 
changes drastically, since then one gets V/n instead of \JV jn. This clearly 
shows that there is some room for improvment of these global minimax 
bounds. 

1.2.2. Refined bounds for VC-classes. Denoting by fi the marginal dis- 
tribution of X under P, if one takes into account the value of 

(1) L{P) = P{Y / s*{X)) = E^X) A (1 - v{X))], 

it is possible to get alternative bounds for the risk which can improve on the 
preceding ones provided that L{P) is small enough [L{P) = corresponds 
to the zero-error case]. The risk bounds found in [5] can be summarized as 
follows. Given Lq S (0, 1/2), if one considers the set VLoiS) of distributions 
P belonging to V{S) such that L{P) = Lq, then (under some measurabihty 
condition) for some absolute constant K3, the following upper bound for the 
risk of the ERM on S is available: 

(2) sup Ellis*, s)]<.,.[^^^±^^^ ifLo>.s{V/n). 
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Moreover, this result is sharp in the minimax sense (up to some logarithmic 
factor) since, for some absolute positive constant k,4, one has 



(3) inf sup E[i{s*,s)]>K4\ if K4Lo(l-2Lo)2 >y/n. 

' P&pLoiS) V n 

We see that (possibly omitting some logarithmic factor) the minimax risk 
can be of order V/n whenever Lq is of order V/n and that the above bounds 
offer some kind of interpolation between the zero-error case and the distri- 
bution free situations. 

However, a careful analysis of the proof of the lower bound (3) shows 
that the worst distributions are those for which the regression function r] 
is allowed to be arbitrarily close to 1/2. This tends to indicate that maybe 
some analysis taking into account the way rj behaves around 1/2 could be 
sharper than the preceding one. 

1.2.3. Faster rates under margin conditions. In [22], Tsybakov attracted 
attention to rates faster than 1 / ^/n that can be achieved by the ERM esti- 
mator under a "margin" type condition which is of a different nature from 
the Devroye and Lugosi condition above, as we shall see below. This condi- 
tion was first introduced by Mammen and Tsybakov (see [14]) in the related 
context of discriminant analysis and can be stated as, 

(4) £{s*,t)>h^\\s* -t\\1 for every 5, 

where || • ||i denotes the Li(//)-norm, h is some positive constant [that we 
can assume to be smaller than 1 since we can always change h into h Al 
without violating (4)] and 6 > I. Since e{s*,t) = Ef,[\2r]{X) - l||s*(X) - 
t(X)|], we readily see that condition (4) is closely related to the behavior of 
ri{X) around 1/2. In particular, we shall often use in this paper the easily 
interpretable condition 

(5) |2??(a;) — 1| > for every x £ X, 

which of course implies (4) with 9 = 1. Tsybakov uses entropy with bracket- 
ing conditions (rather than the VC-condition). In [22], it is shown that, de- 
noting by (e, S, fj.) the Li(//)-entropy with bracketing of S (defined as the 
logarithm of the minimal number of brackets [f,g] with ||/ — ^Hi < e which 
are necessary to cover S), if H^.^ (e, S, ji) <S for some positive number r < 
1, then an ERM estimator s over S satisfies E[£(s*,s)] =0(71-^/(2^+^"!)). 
Hence, Tsybakov's result shows that there is a variety of rates n~" with 
1/2 < a < 1 which can be achieved by an ERM estimator. 
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1.3. Presentation of our results. Our purpose is twofold: 

• Providing a general nonasymptotic upper bound for the risk of an ERM 
which allows to recover Tsybakov's results for classes with integrable en- 
tropy with bracketing and also to derive new bounds for VC-classes under 
margin conditions. 

• Focusing on the margin condition (5) and considering /i as a free parameter 
(which may perfectly depend on n, for instance), we shall prove minimax 
lower bounds showing how sharp the preceding upper bounds are. 

Even if our upper bounds will cover general margin type conditions [like 
Tsybakov's condition (4) or even more general than that], we like the idea 
of focusing on the simpler easy-to- interpret condition (5), which allows com- 
parisons with previous approaches in VC-theory like the one developed in [5]. 
Let us now state some of the results that we prove in this paper. In order to 
take into account the margin condition (5) within a minimax approach, we 
introduce, for every h G [0, 1], the set V{h, S) of probability distributions P 
satisfying the conditions 

(6) \2r]{x)-l\>h for all X G A' and s* G 5 

(one should keep in mind that rj as well as s* depends on P, which gives 
a sense to the definition above), h = corresponds to the global minimax 
approach [one has V{0, S) = V{S)], while h = l corresponds to the zero-error 
case. 

1.3.1. The VC-case. We assume that A has finite VC-dimension V >1 
and we consider an empirical risk minimizer s over S. Then (at least under 
some appropriate measurability assumption on the VC-class) we shall prove 
that, for some absolute positive constant k, either 

sup E[i{s*,s)]<K\ - if/i<W- 
P£Pih,s) \ n \ n 

or 

(7) sup E[^(.*,s)]<K-5^fl + logf^)) ifh>J-. 

PeV{h,s) nh\ \VJJ ^ n 

It turns out that, apart from a possible logarithmic factor, this upper bound 
is optimal in the minimax sense. We indeed show that there exists some 
absolute positive constant k' such that if 2 <V < n, 



(8) inf sup E[e{s*,s)]>K 

« PeV{h,S) 
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These upper and lower bounds coincide up to the logarithmic factor 1 + 
log(n/i^/y) and offer some continuous interpolation between the "global" 
minimax pessimistic bound of order \/V/n corresponding to the situation 
where h = (or h < \fVjn because then with the margin parameter h being 
too small, the margin condition has no effect on the order of the minimax 
risk) and the zero-error case h = 1 for which the minimax risk is of order 
V/n (up to some logarithmic factor). 

In order to compare our bounds with those of Devroye and Lugosi [5] 
recalled above, it is interesting to consider the simple situation where rj 
takes on only two values: (1 — h)/2 and (1 + h)/2. Then by (1), L{P) = 
Lq = (1 — h)/2 so that if, for instance, h = 1/2, then Lq = 1/4 and the upper 
bound given by (7) is of the order of the square of the upper bound given 
by (2). In other words, (2) can be of the same order as in the zero-error case 
only if L{P) is close enough to zero, while our upper bound is of the same 
order as in the zero-error case as soon as the margin parameter stays away 
from (and not only when it is close to 1), which occurs even if L{P) does 
not tend to zero as n goes to infinity as shown in the preceding elementary 
example. 

We shall also discuss the necessity of the logarithmic factor l + log(n/i^/y) 
in (7). We shall see that the presence of this factor depends on something 
other than the VC-property. In other words, for some VC-classes, this factor 
can be removed from the upper bound, while, for some others (which are 
rich enough in a sense that we shall make precise in Section 3), the minimax 
lower bound can be refined in order to make this logarithmic factor appear. 
Quite interestingly, this is, in particular, the case when A is the class of 
half-spaces in M'^. We shall indeed prove that in this case, whenever 2 < d, 
one has, for some positive constant k" , 

inf sup E[£{s\s)] 
Pev{h,s) 

(9) 2 

We do not know if the factor 1 — /i in this lower bound can be removed or 
not but, apart from this factor and up to some absolute positive constant, 
we can conclude from our study that the minimax risk under the margin 
condition with parameter h over the class of half-spaces is indeed of order 
{d/nh){l + \og{nh? /d)), provided that h > \J djn. 

1.3.2. The entropy with bracketing case. We assume that the entropy 
with bracketing of S satisfies 



(10) 



H[.] (e, S, fi) < Kie for every e e (0, 1) 
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for some positive number r < 1. We can analyze what is the influence on the 
risk of an ERM s of the margin condition (5) by introducing the set V{h, S, n) 
of distributions P belonging to V{h,S) with prescribed first marginal dis- 
tribution fi. Then, for some constant Ci depending only on Ki and r, we 
have 

sup E[e{s*,s)]<Ci{{nh^^'~y^/^'-+^^ An^^/^). 
P£V{h,S,iJ.) 

Moreover, this bound is optimal in the minimax sense, at least if the entropy 
with bracketing and the Li(^) metric entropy are of the same order. More 
precisely, recall that the Li(^) metric entropy of S denoted by Hi{e,S,fi) 
is defined as the logarithm of the maximal number of functions ti,...,tN 
belonging to S such that — tj\\i > e for every i / j. If (10) holds and 
if, furthermore, for some positive number <1, one has, for some positive 
constant K2, 

(11) Hi{e,S,ii)>K2e-'' for every e G (0,eo], 

then, for some positive constant C2 depending on Ki,K2,£o and r, one has 

inf sup E[^(s*,s)] >C2(l-/i)^/('^+^)((n/ii^'^)-i/('"+^) An-^/2). 
* P&V{h,S,fj.) 

In [11, 14] or [6], one can find some explicit examples of classes of subsets 
of with smooth boundaries which satisfy both (10) and (11) when fi is 
equivalent to the Lebesgue measure on the unit cube. 

The paper is organized as follows. In Section 2 we give a general theo- 
rem which provides an upper bound for the risk of an ERM via the tech- 
niques based on concentration inequalities for weighted empirical processes 
which were introduced in [15] . The nature of the weight that we are using 
is absolutely crucial because this is exactly what makes the difference at 
the end of the day between our upper bounds for VC-classes and those of 
Devroye and Lugosi [5] which also derive from the analysis of a weighted 
empirical process but with a different weight. This theorem can be applied 
to the classification framework, providing the new results described above, 
but in fact it can also be applied to other frameworks, such as regression 
with bounded errors. Section 3 is devoted to the minimax lower bounds un- 
der margin conditions, while the proofs of all our results are given in Section 
4. We have finally postponed to Section 4.2.3 the statements of essentially 
well known maximal inequalities for empirical processes that we have used 
all along in the paper. 

2. A general upper bound for empirical risk minimizers. In this sec- 
tion we intend to analyze the behavior of empirical risk minimizers within 
a framework which is more general than binary classification. Suppose that 
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one observes independent variables , . . . , ^„ taking their values in some 
measurable space Z with common distribution P . The two main frame- 
works that we have in mind are classification and bounded regression. In 
these cases, for every z, the variable = is a copy of a pair of ran- 

dom variables (X, K), where X takes its values in some measurable space 
X and Y is assumed to take its values in [0, 1] . In the classification case, 
the response variable Y is assumed to belong to {0, 1}. One defines the re- 
gression function r\ as r]{x) = E[y | X = x] for every x ^ X .In the regression 
case, one is interested in the estimation of r/, while in the classification case 
one wants to estimate the Bayes classifier s*, defined for every x ^ X hy 
s*{x) = l^(a;)>i/2- One of the most commonly used methods to estimate the 
regression function r] or the Bayes classifier s* or, more generally, to esti- 
mate a quantity of interest s depending on the unknown distribution P, is 
the so-called empirical risk minimization (according to Vapnik's terminol- 
ogy in [23]). It can be considered a special instance of minimum contrast 
estimation, which is of course a widely used method in statistics, maximum 
likelihood estimation being the most celebrated example. 

2.1. Empirical risk minimization. Basically one considers some set S 
which is known to contain s. Think of S as being the set of all measurable 
functions from X to [0, 1] in the regression case or to {0, 1} in the classifi- 
cation case. Then we consider some loss {or contrast) function 

(12) 7 from cS X Z to [0,1], 

which is well adapted to our problem of estimating s in the sense that the 
expected loss P["f{t, •)] achieves a minimum at the point s when t varies in 
S. In other words, the relative expected loss i defined by 

(13) e{s,t) = P[-f{t,-) -j{s,-)] foralHG5 

is nonnegative. In the regression or the classification case, one can take 7(t, 
(x,y)) = {y — t{x))'^ since t] (resp. s* ) is indeed the minimizer of E[(y — 
t(X))'^] over the set of measurable functions t taking their values in [0, 1] 
(resp. {0,1}). The heuristics of empirical risk minimization (or minimum 
contrast estimation) can be described as follows. If one substitutes the em- 
pirical loss 

1 

(14) lnit)=Pnh{t,■)] = -Y.^{t,C^), 

1=1 

for its expectation P^yit, •)] and minimizes 7„ on some subset S of S (that 
we call a model), there is some hope to get a sensible estimator s of s, at 
least if s belongs (or is close enough) to model 5. This estimation method is 
widely used and has been extensively studied in the asymptotic parametric 
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setting for which one assumes that 5" is a given parametric model, s belongs 
to S and n is large. 

The purpose of this section is to provide a general nonasymptotic upper 
bound for the relative expected loss between s and s. 

We introduce the centered empirical process 7^ defined by 

(15) 7n(t)=7n(t)-P[7(t,-)]- 

In addition to the relative expected loss function i, we shall need another 
way of measuring the closeness between the elements of S which is directly 
connected to the variance of the increments of 7„ and therefore will play 
an important role in the analysis of the fluctuations of 7^. Let d be some 
pseudo-distance on S x S (which may perfectly depend on the unknown 
distribution P) such that 

(16) Varp[j{t,-) --/{s,-)]<d'^{s,t) for every t E 5. 

Of course, we can take d as the pseudo-distance associated with the variance 
of 7 itself, but it will be more convenient in applications to take d as a more 
intrinsic distance. For instance, in the regression or the classification setting 
it is easy to see that d can be chosen (up to some constant) as the ^2(1^) 
distance, where we recall that // denotes the distribution of X. Indeed, for 
classification, 

h{tdx,y)) - lis* ,{x,y))\ = \ty-^t{x) - '^y^s*{x)\ < \t{x) - S*{x)\ 

and, therefore, 

Yarp[-f{t,-)-j{s*,-)]<d\s*,t) with d'(s,t) = E^[{t{X) - s* {X))\ 
while, for regression, 

[j{t,{x,y)) - j{r],{x,y))f = [t{x) - r]{x)]'^[2{y - ri{x)) - t{x) + ri{x)f . 
Since Ep[Y - r]{X) \X]=0 and Ep[{Y - r]{X)f \ X] < 1/4, we derive that 
Ep[[2{Y-7^{X))-t{X) + r,{xf\X] 

= AEp[{Y - n{X)f\X] + {-t{X) + r^{X)f 
<2, 

and therefore, 

(17) Ep[^{t, (X, Y)) - 7(r/, (X, Y))f < 2E^{t{X) - r^{X)f. 

Our main result below will crucially depend on two different moduli of 
uniform continuity: the stochastic modulus of uniform continuity of 7„ over 
S with respect to d and the modulus of uniform continuity of d with respect 
to i. 
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The main tool that we shall use is Talagrand's inequality for empirical 
processes (see [21]) which will allow us to control the oscillations of the em- 
pirical process 7„ by the modulus of uniform continuity of 7^ in expectation. 
More precisely, we shall use the following version of it due to Bousquet [4] 
which has the advantage of providing explicit constants and of dealing with 
one-sided suprema. If ^ is a countable family of measurable functions such 
that, for some positive constants v and b, one has, for every f & Pif^) < v 
and ll/lloo < b, then, for every positive y, the following inequality holds for 



Unlike McDiarmid's inequality (see [18]) which has been widely used in 
statistical learning theory (see [13]), a concentration inequality like (18) 
offers the possibility of controlling the empirical process locally. Applying 
this inequality to some conveniently weighted empirical process will be the 
key step of the proof of Theorem 2 below. 

2.2. The main theorem. We need to specify some mild regularity condi- 
tions that we shall assume to be verified by the moduli of continuity involved 
in our result. 

Definition 1 . We denote by Ci the class of nondecreasing and contin- 
uous functions ip from to M+ such that x — > tp{x)/x is nonincreasing on 
(0,+oo) and ^(1) > 1. 

Note that \i ij^ \s a. nonincreasing continuous and concave function on 
with ^(0) = and ^(1) > 1, then ^ belongs to Ci. In particular, for 
the applications that we shall study below, an example of special interest is 
tp{x) = Ax"', where a € (0, 1] and A>1. 

In order to avoid measurability problems and to use the concentration 
inequality above, we need to consider some separability condition on S. The 
following one will be convenient. 

(M) There exists some countable subset S' of S such that, for every t £ S, 
there exists some sequence (i^) of elements of S' such that, for every 
^£ Z, j{tk,(,) tends to j{t,S,) as k tends to infinity. 

We are now in position to state our upper bound for the relative expected 
loss of any empirical risk minimizer on some given model S. This bound will 
depend on the bias term i{s,S) = inftg5£(s,t) and on the fluctuations of 
the empirical process 7„ on S. As a matter of fact, we shall consider some 
slightly more general estimators. Namely, given some nonnegative number p, 



Z = sup f^APn-P){f )■■ 



(18) 
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we consider some /j-empirical risk minimizer, that is, any estimator s taking 
its values in S such that 7n.(s) < p + inf^gs 7„,(t). 

Theorem 2. Let j be a loss function satisfying (12) such that s min- 
imizes P{'y{t,-)) when t varies in S. Let i, and 7„ be defined by (13), 
(14) and (15) and consider a pseudo- distance d on S x S satisfying (16). 
Let (p and w belong to the class of functions Ci defined above and let S be 
a subset of S satisfying the separability condition (M). Assume that, on the 
one hand, 

(19) d{s,t) <w{^Ji{s,t)) for every t£S, 

and that, on the other hand, one has, for every S' , 



(20) V^E 



sup [7„(n) - 7„(t)] 

teS' ,d{u,t)<a 



< 



for every positive a such that (t){cr) < ^/na^ , where S' is given by assumption 
(M) . Let be the unique positive solution of the equation 

(21) V^el = cl){w{e,)). 

Then there exists an absolute constant k such that, for every y > 1, the 
following inequality holds: 

(22) F[£{s, s)>2p + 2e{s, S) + Kyel] < e~^. 
In particular, the following risk bound is available: 

E[£{s,s)]<2{p + £{s,S) + Kel). 

Remarks. Let us first give some comments about Theorem 2: 

• The absolute constant 2 appearing in (22) has no magic meaning here. 
It could be replaced by any C > 1 at the price of making the constant k 
depend on C. 

• One can wonder if an empirical risk minimizer over S exists. Note that 
condition (M) implies that, for every positive p, there exists some mea- 
surable choice of a p-empirical risk minimizer since then mft(zs' Inii) = 
inftg5 7„(t). If yO = 1/n, for instance, it is clear that, according to (22), 
such an estimator performs as well as a strict empirical risk minimizer. 

• For the computation of (p satisfying (20), since the supremum appearing 
in the left-hand side of (20) is extended to the countable set S' and not S 
itself, it will allow us to restrict ourselves to the case where S is countable. 

• It is worth mentioning that, assuming for simplicity that s £ S, (22) still 
holds if we consider the empirical loss 7„(s) — 7^(5 ) instead of the expected 
loss i{s,s). This is indeed a by-product of the proof of Theorem 2 to be 
found in Section 4. 
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Even if the main motivation for Theorem 2 is the study of classification, 
it can also be easily applied to bounded regression. We begin the illustration 
of Theorem 2 within this framework, which is more elementary than classi- 
fication since in this case there is a clear connection between the expected 
loss and the variance of the increments. 

2.3. Application to bounded regression. In this setting, the regression 
function 77:2; —> Ep[Y \ X = x] is the target to be estimated, so that here 
8 = 7]. We recall that for this framework we can take d to be the IL2(//) 
distance times ^/2. The connection between the loss function I and d is 
especially simple in this case. Indeed, ['y{t,{x.,y)) — 7(7/, (x, y))] = [—t{x) + 
r/(x)][2(y - ri{x)) - t{x) + 7]{x)], so that ^p[y- | X] = implies that 

^(r/, t) = Ep[^{t, {X, Y)) - 7(7?, (X, ¥))] = E^{t{X) - r]{X)f . 

Hence, 2£(rj,t) = d^{r],t) and in this case the modulus of continuity w can 
simply be taken as w{e) = \f2e. The quadratic risk of an empirical risk 
minimizer over some model S depends only on the modulus of continuity 
(\) satisfying (20) and one derives from Theorem 2 that, for some absolute 
constant k', E[(i^(T/, s )] < 2^^(77, S) -l-^'e^, where is the solution of i/ne* = 
(?f)(e*). To be more concrete, let us give an example where this modulus 4> 
and the bias term d'^{r],S) can be evaluated, leading to an upper bound for 
the minimax risk over some classes of regression functions. 

2.3.1. Binary images. Following Korostelev and Tsybakov [11], our pur- 
pose is to study the particular regression framework for which the variables 
Xj's are uniformly distributed on [0, 1]^ and r]{x) = Ep[Y \ X = x] is of the 
form r]{xi,X2) =b if X2< dr]{xi) and a otherwise, where dr] is some mea- 
surable map from [0, 1] to [0, 1] and < a < 6 < 1. The function dr] should 
be understood as the parametrization of a boundary fragment correspond- 
ing to some portion 7/ of a binary image in the plane (a and b representing 
the two levels of color which are taken by the image), and restoring this 
portion of the image from the noisy data {Xi,Yi), . . . , (X„,l^) means esti- 
mating rj or, equivalently, dr]. Let Q be the set of measurable maps from 
[0,1] to [0,1]. For any f €Q, let us denote by Xf the function defined on 
[0,1]^ by Xf{^iTX2) = 6 if X2 < f{xi) and a otherwise. From this defini- 
tion, we see that Xdri = f] and, more generally, if we define S = {xf '. f ^ Q}, 
for every t £ S, we denote by dt the element of G such that Xdt = t- It is 
natural to consider here as an approximate model for r] a model S of the 
form S = {xf. f £ dS}, where dS denotes some subset of Q. Denoting by 
II • 111 (resp. II • II2) the Lebesgue Li-norm (resp. L2-norm), one has, for every 

WXf - Xg\\i = {b- a)\\f - g\\i and \\Xf - XgWl = (b - af\\f - g\\i 
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or, equivalently, for every s,t£S, 

\\s — t\\i = {b — a)\\ds — dt\\i and 



t\\l = (b-af\\ds-dt\\i. 



Given u = Xg & S , we have to compute some function cp satisfying (20) and 
therefore to majorize E[VF(cj)], where W{a) = sup^gg(„ g.) 7„(ii) — 7„(t). This 
can be done using entropy with bracketing arguments. Indeed, let us notice 
that iff — S<f'<f + S, then, defining Jl = sup(/ — S, 0) and fu = inf (/ + 
6,1), the following inequalities hold: XfL<Xf' < Xfu and ||x/l - X/y 111 < 
2(6 — a)6. This means that, setting = {t £ S,d(t,u) < cr}, dSp = {/ G 
dS, 11/ — fflli ^ p} and defining Hoo{5,p) as the Lqo metric entropy for radius 
5 of dSp, one has, for every positive e. 



H[.]{e,Sa,^x)<H^ 



a 



2(6-a)' 2(6-a)V 



Moreover, if [tL,tu] is a bracket with extremities in S and Li(p) diameter 
not larger than 5 and if t G [tL,tu], then 

2/2 - 2tu{x)y + ti(x) <{y- tix)f <y^- 2tL{x)y + tl{x), 

which implies that 7(-,t) belongs to a bracket with Li(P)-diameter not 
larger than 

tu{x)+tL{xy 



2Ep 



{tu{X)-tL{X))(Y + 



< 25. 



Hence, ii = {'y{-,t),t £ S and d{t, u) < a}, then 



H[.]ix,J',P)<H^ 
and furthermore, if d{t,u) < a, 



X 



a 



4(6-a)' 2(6-a)^ 



E[|(y - t{X)f -{Y - u{X)f\] < 2\\u - t\\i 



2||n-t| 



< 



{b-ay 



Setting 



u I \/h—a 



X 



a 



1/2 



dx, 



JO V \4(6-a)' 2(6-a)^ 

we derive from Lemma A. 4 that -y/nE[H^(cj)] < 12ip{a), provided that 

4v5(o") < y/n- 



(23) ^..-^._v.^., 

[b-a) 

The point now is that, whenever dS is part of a linear finite-dimensional 
subspace of Loo[0, 1], Hoo{5,p) is typically bounded by D[B + log(p/(5)] for 
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some appropriate constants D and B. If it is so, then 



rcr/y/b—a / 



— a) 



dx 



a JO 



B + 2\log{S)\d6, 



which impUes that, for some absolute constant k, ip{cr) < kct^/ (1 + B)D/{b — a) 
The constraint (23) is a fortiori satisfied if a\/b — a > AKy^ (1 + B)D/n. 
Hence, if we take (/>(cr) = 12k(7-\/ (1 + B)D /{b — a), assumption (20) is satis- 
fied. To be more concrete, let us consider the example where dS is taken to be 
the set of piecewise constant functions on a regular partition with D pieces 
on [0,1] with values in [0,1]. Then, it is shown in [1] that HaoiS,dS, p) < 
D\log{p/S)] and, therefore, the preceding analysis can be used with B = 
0. As a matter of fact, this extends to piecewise polynomials with de- 
gree not larger than r via some adequate choice of as a function of 
r, but we just consider the histogram case here to be simple. As a con- 
clusion. Theorem 2 yields in this case for the empirical risk minimizer s 
over S 

E[\\dr] - ds\\i] < 2 inf \\dr] - dt\\i + C „ 

tes [b — a)'^n 

for some absolute constant C. In particular, if dr] satisfies the Holder smooth- 
ness condition \d'q{x) — dr]{x')\ < L\x — x'l" with L > and a G (0, 1], then 
infjgs' ||9r? — < LD~"', leading to 

E[\\di]-ds\\i]<2LD~'^ + C ^ 



{b — a)^n 

Hence, if 7i{L,a) denotes the set of functions from [0,1] to [0,1] satisfy- 
ing the Holder condition above, an adequate choice of D yields, for some 
constant C depending only on a and b, 

/ 1 \ l/{a+l) 

sup E[\\d7]-ds\\i]<C'{L\/ -) n-"/(^+"). 



n 



As a matter of fact, this upper bound is unimprovable (up to constants) 
from a minimax point of view (see [11] for the corresponding minimax lower 
bound). 



2.4. Application to classification. Our purpose is to apply our main the- 
orem to the classification setting, assuming that the Bayes classifier is the 
target to be estimated, so that here s = s* . We recall that for this frame- 
work we can take d to be the L2(/-f)-distance (which is also the square root 
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of the Li(^)-distance since we are dealing with {0, l}-valued functions) and 
S = {1a, A £ A}, where A is some class of measurable sets. Our main task 
is to compute the moduli of continuity (/) and w. In order to evaluate w, 
we need some margin type condition. For instance, we can use Tsybakov's 
margin condition (4) so that we can also write 

(24) £{s, t) > h^<f\s, t) for every t G S. 

As quoted in [22], this condition is satisfied if the distribution of r]{X) 
is well behaved around 1/2. If condition (5) holds, then one simply has 
^{s,t) = E^[\2r]{X) - l\\s{X) - t{X)\] > h(f{s,t), which means that Tsy- 
bakov's condition (24) is satisfied with = 1. Of course, condition (24) im- 
plies that the modulus of continuity w can be taken as 

(25) w{e)=h-^l''e'l'. 

According to the remark following Theorem 2, we shall first assume S 
to be countable, knowing that our conclusions will remain valid if S is 
just assumed to satisfy the separability condition (M). In order to eval- 
uate (/), we shall consider two different kinds of assumptions on S which 
are well known to imply the Donsker property for the class of functions 
{7(t, •),t G 5} and therefore the existence of a modulus (/> which tends to 
at 0, namely, a Vapnik-Chervonenkis (VC) condition or an entropy with 
bracketing assumption. Given u € S, in order to bound the expectation of 
W{a) = sup^(„ j)<g.(— 7„(t) +7„(ii)), we shall use the maximal inequalities 
for empirical processes which are established in the Appendix via slightly 
different techniques according to the way the "size" of the class A is mea- 
sured. 

2.4.1. The VC-case. We begin with the celebrated VC-condition, which 
ensures that {'y{t, ■),t £ S} has the Donsker property whatever P. So let us 
assume that ^ is a VC-class. One has at least two ways of measuring the 
"size" of the class A (or, equivalently, of the class of classifiers S = {1a, A £ 
A}): 

• The random combinatorial entropy defined as 

HA = log#{An{Xi,...,Xn},AeA}, 

which is related to the VC-dimension V oi A via Sauer's lemma (see [13], 
e.g.) which ensures that 

whenever n>V. 
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• The Koltchinskii-Pollard notion of universal metric entropy defined as fol- 
lows. For any probability measure Q and every positive e, let H2{e,S,Q) 
denote the logarithm of the maximal number of functions ti,. . . ,t]\f be- 
longing to 5, such that EQ{ti — tj)"^ > for every i ^ j, and define the 
universal metric entropy as 

(26) //univ(e,S) = supi/2(e,5,Q), 

Q 

where the supremum is extended to the set of all probability measures on 
X . The universal metric entropy is related to the VC-dimension via Haus- 
sler's bound H^a\v{£-,A) < kV{1 -|- log(e~-'^ V 1)), where k denotes some 
absolute positive constant (see [8]). 

The way of expressing (p by using either the random combinatorial entropy 
or the universal metric entropy is detailed in the Appendix. Precisely, to use 
the maximal inequalities proved in the Appendix, we introduce the classes 
of sets 

and 

= {{{x,y):ly^t^.-,) > ly^,,(^^)},te S}. 
Then we define, for every class of sets B of Xx{0, 1}, 

W+{a) = sup {Pn - P){B) and W^{a) = sup (P - P„)(5). 

BeB,P(B)<cT^ BeB,P(B)<a^ 

Then 

(27) E[W{a)] < E[W+^ (a)] + E[W^_ (a)] 

and it remains to control E[W^^((t)] and ]E[VF4 (o")] via Lemma A. 3, which 
is based either on some direct random combinatorial entropy approach or 
on some chaining argument and Haussler's bound on the universal entropy 
recalled above. 

More precisely, since the VC-dimensions of A-\^ and A- are not larger 
than that of A, and that similarly, the combinatorial entropies of A+ and 
A- are not larger than the combinatorial entropy of A, denoting by V the 
VC-dimension of A (assuming that F > 1), we derive from (27) and Lemma 
A. 3 that ■y/nE[VK((j)] < 0((t), provided that (picr) < \/na^ ^ where (p can be 
taken either as 

(28) 0(a) = Ka^{\yE{Rj^\) 



or as 
(29) 



<\>{a) = Ka^V{l + \og{a-^yl)). 
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In both cases, assumption (20) is satisfied and we can apply Theorem 2 
with 'w = lorw defined by (25). When (p is given by (28), the solution e* of 
equation (21) can be explicitly computed when w is given by (25) or w = 1. 
Hence, the conclusion of Theorem 2 holds with 



* \ nh J \l n 

In the second case, that is, when cp is given by (29), w = 1 implies by (21) 
that el = K^yVJn, while if w{e^:) = hr^l'^ej^ , then 



(30) el = Ke]J\\^^\ + log((^6;^/') V 1). 

V nh 

bmce 1 + log((^e* ^/^) V 1) > 1 and K > 1, we derive from (30) that 
Plugging this inequality in the logarithmic factor of (30) yields 



and, therefore, since 0>l,el< KeV^ ^V/{nh)^l + \og{{nh'^() /V) V 1). Hence, 

.29 /F^ w^^^^e/(2e-l) 



£2 



2 ^ ( K^V{l+\og{{nh^'^/V)yl)) V 
*-\ nh J 

V{1 + \og{{nh^^/V) V 1)) \ ^/(^''-D 



nh 

and, therefore, the conclusion of Theorem 2 holds with 



el = K'' 



V{1 + log((n/i27y) V 1)) ^^ ^ 



n/i J y n 



We have a fortiori obtained the following result for the ERM on 5 = {1^, ^ G 
A}. 

Corollary 3. Assume that S satisfies (M) and that A is a VC-class 
with dimension V > 1. There exists an absolute constant C such that if 
s denotes an empirical risk minimizer over S and if s* belongs to S, the 
following inequality holds: 



(32) E[i{s*,s)]<C\ 



n 
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Moreover, if 6 >1 is given and one assumes that the margin condition (24) 
holds with h > (y/n)^/^^, then the following inequalities are also available: 



(33) EW.*,^)]<c(iiX|M) 



e/{2e~i) 



and 

(34) m,{s\s)\<c[ '^^^^^"S^"" /'^^M 

\ nh ) 

Let us comment on these results: 

• The risk bound (32) is weh known. Our purpose here was just to show 
how it can be derived from our approach. 

• The risk bounds (33) and (34) are new and they perfectly fit with (32) 
when one considers the borderline case very 
similar but are not strictly comparable since, roughly speaking, they differ 
by a logarithmic factor. Indeed, it may happen that E[-ff^] turns out to be 
of the order of V (without any extra log factor) . This is the case when A is 
the family of all subsets of a given finite set with cardinality V . In such a 
case, ]E[i^_4] < V and (33) is sharper than (34). On the contrary, for some 
arbitrary VC-class, if one uses Sauer's bound on Hj^, that is, Hj^ < V{1 + 
log(n/y)), the log-factor l + log(n/y) is larger than l + log{nh^^ /V) and 
turns out be too large when h is close to the borderline value (y/n)^/^^. 

• For the sake of simplicity, we have assumed s* to belong to S in the above 
statement. Of course, this assumption is not necessary (since our main 
theorem does not require it). The price to pay if s* does not belong to S 
is simply to add 2i{s* ,S) to the right-hand side of the risk bounds above. 

In the next section we shall discuss the optimality of (34) from a min- 
imax point of view in the case where = 1, showing that it is essentially 
unimprovable in that sense. 



2.4.2. Bracketing conditions. The Li(;u) entropy with bracketing of S 
is denoted by H[.](6, S, fi) and has been defined in Section 1. The point 
is that, setting = {'y{-,t),t G S with d{u,t) < a}, one has H[.^{5,T, P) < 
H[.j{5, S, fi). Hence, since we may assume S to be countable (according to 
the remark after Theorem 2), we derive from (27) and Lemma A. 4 in the 
Appendix that, setting ip{a) = Jq H^^^"^ {x"^ , S, fj.) dx, the following inequality 

is available: T/nE[VF(c7)] < 12if{a), provided that 4(/9((t) < a'^^/n. Hence, we 
can apply Theorem 2 with ^ = 12(p, and if we assume Tsybakov's margin 
condition (24) to be satisfied, then we can also take w{e) = (h'^^'^el^^) A 1 
according to (25) and derive that the conclusions of Theorem 2 hold with 
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the solution of the equation y/ne^ = (j){{h^^/'^el ) A 1). Moreover, if we 
assume that condition (10) holds for the entropy with bracketing, then, for 
some constant C depending on the constant Ki coming from (10), one has 

(35) el < C"[((l - rfnh'-^)-'/^''-'+'-^ A (1 - r)-\-'/% 

Of course, this conclusion still holds if S is no longer assumed to be countable 
but fulfills (M). We can alternatively take T to be some 5n-net [with respect 
to the L2(^)-distance d] of a bigger class S to which the target s* is assumed 
to belong. We can still apply Theorem 2 to the empirical risk minimizer 
over T, and since H[.j{x,T, fi) < H[.j{x,S,fi), we still get the conclusions of 
Theorem 2 with satisfying (35) and £{s*,T) < (5^. This means that if 6^ is 
conveniently chosen (in a way that 5„ is of lower order as compared to e^, ) , 
for instance, 5^ = n~^/(^'^'''\ then, for some constant C" depending only on 
Ki, one has 

(36) ni{s*,s)] < C"[{{1 - rfnh^-T''^^'"^^"'^ A (1 - r)-^n~^'\ 

This means that we have recovered Tsybakov's Theorem 1 in [22] (as a 
matter of fact, our result is slightly more precise since it also provides the 
dependence of the risk bound with respect to the margin parameter h and 
not only on 6 as in Tsybakov's theorem). We refer to [14] for concrete ex- 
amples of classes of sets with smooth boundaries satisfying (10) when is 
equivalent to the Lebesgue measure on some compact set of W^. 

3. Minimax lower bounds for classification under margin conditions. We 

still consider the binary classification framework for which one observes n 
i.i.d. copies (Xi, Yi), . . . , (X„, 1^) of a pair of random variables {X,Y) G 
X X {0,1}. The aim is to estimate the Bayes classifier s* . The natural loss 
function to be considered is 

£{s\t) = P{Y ^t{X))-P{Y ^s*{X)) > 

for any classifier t : A' — > {0, 1}. Our purpose here is to establish lower bounds 
matching with the upper bounds for the risk of an empirical risk minimizer 
provided in the preceding section. In particular, we wish to take into ac- 
count the effect of the margin condition which has been already analyzed 
for the upper bounds. Toward this aim, we shall use the minimax point of 
view, but under a convenient margin restriction on the distribution P of the 
pair {X,Y). Namely, we shall assume that P belongs to the collection of 
distributions V{h,S) as defined by (6). If A denotes the class of sets linked 
to S, that is, 5 = {1^,^ £ -4}, such as for the upper bounds, the way of 
measuring the size of A will infiuence the construction of the lower bounds 
for the minimax risk 



(37) 



Rn{h,S) = mi sup E[e{s*,s)], 
■^^3 PeP{h,s) 
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where the infimum is taken over the set of all estimators based on the n- 
sample (Xi, Yi), . . . , taking their values in S. We begin with the 

case where ^ is a VC-class. 

3.1. VC-classes. We assume ^ to be a VC-class of subsets of X with 
VC-dimension ^ > 1. Some lower bounds for Rn{h, S) are well known in the 
two extreme cases h = and h = l. 

If h = 0, we are in a (pessimistic) global minimax approach for which 
there is no margin restriction in fact and the following lower bound can be 
found in [5]: 



for every n > 5{V — 1). 

If /i = 1, we are in the zero-error case for which Y = s*{X) and we have 
at our disposal a lower bound proved by Vapnik and Chervonenkis in [24] 
(see also [9]), 



for every n>2V {V — 1). 

As expected, the order of these lower bounds for the minimax risk is very 
sensitive to the set of joint distributions over which the supremum is taken. 
Our purpose is to provide a continuous link between the general case h = 
and the zero-error case h = l. We first prove a lower bound which holds for 
any VC-class A and then discuss the presence or not of an extra logarithmic 
factor in the lower bound for some particular examples. As nicely described 
in [26], there exist several techniques to derive minimax lower bounds in 
statistics. We shall use two of them below which are based either on Hellinger 
distance or Kullback-Leibler information computations. 

3.1.1. A general lower bound. Let A be some class of measurable subsets 
of X and S be the set of classifiers S = {1a,A £ A}. When A is an arbi- 
trary VC-class, our lower bound for the minimax risk on S under a margin 
condition will be obtained via the "Assouad cube" device which involves 
Hellinger distance computations. 

Theorem 4. Given h G [0,1], we consider the minimax risk Rn{h,S) 
over the set of distributions V{h,S) as defined in (6) and (37). There exists 
an absolute positive constant n such that, if A is a VC-class with dimension 
V >2, one has 



(38) 




(39) 



Rn{l,S)> 



V-1 
4en 



(40) 
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ifn>V. 



The proof of this result wih be given in Section 4.2.1. The novelty of this 
lower bound emerges h> \/V/n since then we see that there is indeed 
an effect of the margin condition as compared to the global bound (38). In 
particular, for h = l, we recover (39), up to some absolute constant. 

Let us now discuss the sharpness of this lower bound by comparing it 
to the upper bounds derived in the preceding section. Let us consider the 
very simple example where A is the collection of all subsets of a given set 
with cardinality V. Then of course ^ is a VC-class with dimension V, and 
inequalities (32) and (33) in Corollary 3 ensure that, for some absolute 
constant k' , 



Since (at least if F > 2) this upper bound coincides with the preceding 
lower bound up to some absolute constant, this shows that these bounds 
provide the right order for the minimax risk and therefore cannot be further 
improved in this case. However, there exist "richer" VC-classes than this 
one for which the logarithmic factor appearing in (34) is in some sense 
necessary. This is precisely the purpose of the next section to provide a new 
combinatorial condition (satisfied by some but not all VC-classes) under 
which an extra logarithmic factor must appear in the minimax risk. 

3.1.2. A refined lower bound for ^^rich" VC-classes. Our purpose is to 
propose an alternative lower bound for Rn{h, S) when A is rich enough in a 
combinatorial sense that we are going to make explicit. Given some integers 
D and A^, we introduce the following combinatorial property for the class of 
sets A: 

(A7v,d) There exist points xi, X2, ■ ■ ■ ,xj\f of X such that the trace of A 
on X = {xi,X2, ■ ■ ■ ,xj\f} defined by 

Tr(x) = {An {xi, X2, . . . ,xn} ■■ A £ A} 

contains all the subsets of {xi,X2, ■ ■ ■ , X]\f} with cardinality D. 

By definition, if ^ is a VC-class with dimension V, then A satisfies {Av,d) 
for all 1 < D <V . It is also clear that, given 1 < D <V , the VC-class which 
was analyzed at the end of the preceding section does not satisfy (Atv,^) as 
soon as N >V. On the contrary, we shall see below that there are some non- 
trivial examples of VC-classes which satisfy (Atv,!)) for arbitrarily large val- 
ues of and suitable values of D. A convenient information theoretic lemma 
and combinatorial arguments lead to the following refinement of Theorem 4 
that we shall apply to these types of VC-classes. Recall that S denotes the 
class of classifiers associated with A, that is, S = {1a', A A}. 
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Theorem 5. Given D >1, assume that A satisfies (Ajv^/j) for every 
integer N such that N > AD. Given h G [0; 1) , we consider the minimax risk 
Rn{h,S) over the set of distributions V{h,S) as defined in (6) and (37). 
Then, there exists an absolute positive constant c such that 



(41) 

provided that 



Rn{h,S)>cil 



nn 




The proof will be given in Section 4.2.2. We intend now to give two explicit 
examples of VC-classes for which we can apply the preceding lower bound. 
Given D >1, assume X to be some infinite and countable set and let A 
be the collection of all subsets with cardinality D oi X. Then ^ is a VC- 
class with dimension D which obviously satisfies property (Atv.d) for every 
integer N > D. Hence, 



Rn{h,S)>c{l 



nn 



provided that h > y^D/n, and if we compare this lower bound with the upper 
bound (34), we see that they involve exactly the same logarithmic factor and 
that they differ by an absolute multiplicative constant times 1 — h. Thus, 
apart from this factor 1 — h, the order of the minimax risk has been identified. 
As a matter of fact, we do not know how to get rid of this nuisance factor 
1-h. 

This first example could appear to be rather artificial. More interestingly, 
our result also applies to half-spaces in W^, for d>2. Indeed, a very nice 
combinatorial geometric result to be found in [7] says that, for every integer 
N > d + 1, there exist distinct points xi,X2, ■ ■ ■ ,xn of M'^ such that the 
trace of the collection of half-spaces in on {xi ,X2, ■ ■ ■ , x^} contains all the 
subsets of {xi,X2, ■ ■ ■ ,xn} with cardinality k < [d/2]. This means that the 
class A of half-spaces a fortiori satisfies (Ajy [^/2])) ^or every integer N > d. 
Hence, Theorem 5 applies with D = [d/2] and we get 



Rnih,S)>-{l 



h) 



d_ 

nh 



1 + log 



nh^ 
~d 



provided that h > \/ djn. Furthermore, the VC-dimension of A is known to 
be equal to d-l- 1 so that we readily see, as in the preceding example, that the 
upper bound which derives from (34) coincides with the above lower bound, 
apart from an absolute constant and possibly the nuisance factor 1 — h. 

The conclusion of the preceding analysis is that the extra logarithmic 
factor appearing in the upper bound (34) cannot be avoided in general. 
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3.2. A lower bound under some purely metric condition. Our purpose 
is now to provide a rather general lower bound under some purely metric 
assumption on S instead of the VC-property. Let V{h, S, /i) be the set of dis- 
tributions P belonging to V{h, S) with prescribed first marginal distribution 
fi and Rn{h, S, fj.) be the corresponding minimax risk 

Rn{h,S,i2) = mi sup E[e{s*,s)]. 
The following general result is available. 

Theorem 6. Let fi be a probability measure on X and S be some class 
of classifiers on X such that, for some positive constants Ki,K2, Eq and r, 

K2e-'' <Hi{e,S,ii)<Kie-^ 

for all < e < Eq, where Hi{-,S,n) denotes the L,i{fi) -metric entropy of S. 
Then, there exists a positive constant K depending on Ki, K2, Eq and r 
such that the following lower bound holds: 

(42) Rn{h,S,ii) > K{1 - /i)i/(i+'^)[(/i-(i-'^)/(i+^-)n-i/(i+'')) 
whenever n>2. 

The proof of this result will be given in Section 4.2.3. If we are in a situa- 
tion where the Li(//) metric entropy and the Li(//) entropy with bracketing 
are of the same order, we can compare this lower bound with the upper 
bound (36). More precisely, let us assume that, for some positive constants 
Ki,K2, Eq and r < 1, one has 

(43) K2E-^ < Hi (e, S, fi) < (e, S, //) < A'le"'' 

for every E <Eo. Then up to a constant (depending on Ki,K2, Eq and r) 
and the (1 - /i)^/(''+^) factor, we see that the lower bound (42) and the 
upper bound (36) coincide. Note that (43) is, in particular, satisfied when 
^ is a collection of sets with smooth boundaries in various senses as shown 
in [11, 14] or [6]. 

4. Proofs of the main results. 

4.1. The upper bound: proof of Theorem 2. Since S satisfies (M), we 
notice that, by dominated convergence, for every t £ S, considering the se- 
quence (tk) provided by condition (M), one has P{'j{-,tk)) that tends to 
P{'y{-,t)) as k tends to infinity. Hence, i{s,S) =£{s,S'), which implies that 
there exists some point 7r(s) (which of course may depend on e^) such that 
it{s) G S' and 

(44) e{s,Tr{s))<i{s,S)+El 
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We start from the identity 

^(S,S) =£(s,7r(s)) +7„(s) -7n(vr(s)) +7n(^('S)) - 7n(s)i 

which, by definition of s, imphes that 

iis, s)<p + lis, 7r(s)) + 7n(7r(s)) - 7n(s )• 

Let X = y/K'ys^, where k' is a constant to be chosen later such that k' >1 
and 



Vx = sup 



7n(7r(s)) -7n(*) 



Then, 

iis, s)<p + i{s, Tr{s)) + V^iiis, s ) + x^) 
and therefore, on the event Vx < 1/2, one has 

£{s,s)<2{p + £{s,7r{s))) + el + x^, 

yielding 

(45) F[e{s, s) > 2(p + i{s, S)) + 3el + x^] < F[Vx > i]. 

Since i is bounded by 1, we may always assume x (and thus e*) to be not 
larger than 1. Assuming that x < 1, it remains to control the variable Vx via 
Bousquet's inequality. In order to use Bousquet's inequality, we first remark 
that, by assumption (M), 

^^-S£(.,t)+e2+x2' 

which means that we indeed have to deal with a countably indexed empirical 
process. Note that the triangle inequality implies via (16), (44) and (19) that 

(Varp[7(t, •) - 7(vr(s), < dis,t) + d(s,^(s)) 

(46) 

<2w{^£{s,t) + el). 

Since 7 takes its values in [0,1], introducing the function wi = 1 A 2w, we 
derive from (46) that 



sup Var p 



7(t,-) -7(71(5),- 
£(5,t) + e2 + x2 



< sup ^ < sup 



1 



\£y x) 



£>o (e^ + 2;2 

Now the monotonicity assumptions on w imply that either w{£) < w{x) if 
x>e or w{e)/e < w{x)/x if x < e. Hence, one has in any case tt;(e)/(eVx) < 
w{x)/x, which finally yields 



sup Var p 

t&s 



7(t,-) -7(7r(s),-) 



{s,t) + x^ 



< 



Wi{x) 
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On the other hand, since 7 takes its values in [0, 1] , we have 



sup 

tes 



7(t,-) -7(7r(s),-) 



< 



We can therefore apply (18) with v = Wi{x)x~'^ and b = x~^, which gives 
that, on a set ^ly with probability larger than 1 — exp(— y), the inequality 



rr ^rx.n 2(w^(x)x-^ +4E[VJ)y y 

Now since e^, is assumed to be not larger than 1, one has w(e*) > and 
therefore, for every a > w{e^:), the following inequality derives from the def- 
inition of by monotonicity: 



U7 



< 



u; (£*)_) ^ cl){w{e^)) 



n. 



Thus, (20) holds for every a > w{e^). In order to control E[Vj,], we intend 
to use Lemma A. 5. For every t e S', we introduce a^{t) = i{s,Tr{s)) V £{s,t). 
Then by (44), i{s, t) < a^it) < i{s, t) + Hence, we have, on the one hand, 
that 

7n(7r(s)) -7„(t)' 



E[K] < E 



sup- „ 

.tGS' a^{t) + x'^ 

and, on the other hand, that, for every 



E 



sup (7n(7r(s)) -7„(i)) 

t£S',a(t)<e 



<E 



sup (7n(7r(s)) -7n(0) 

t£S',e{s,t)<e^ 



Now by (44) if there exists some t € S" such that i{s,t) < e^, then i{s,7r{s)) < 
+ < 2e^ and therefore, by assumption (19) and monotonicity of ^ — > 
w{e)/e, d{7r{s),t) < 2w{ey/2) < 2V2w{e). Thus, we derive from (20) that, 
for every e > e* , 



E 



sup (7n(7r(s)) -7n(i)) 

t£S',e{s,t)<e^ 



< (t){2V2w{e)) 



and since 6^^{2V2w{e))/e is nonincreasing, we can use Lemma A. 5 to get 

E[14] < 4(/)(2V2u;(x))/(V^x2), 
and by monotonicity of 6 ^ (j){0)/9, 

E[K] < 8V2(j){w{x))/{^x'^). 

Thus, using the monotonicity of 6 ^ (f){w{9))/9, and the definition of we 
derive that 



(48) 



nxe* 
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provided that x> e^,, which holds since k' > 1. Now, the monotonicity of 
e wi{e)/e imphes that x-'^wl{x) < e-'^wl{e^), but since (/)(6')/6' > (/>(!) > 
1 for every 9 £ [0,1], we derive from (21) and the monotonicity of (f) and 
9^<p{e)/e that 



and, therefore, x '^w1{x) < Ane^. Plugging this inequality together with (48) 
into (47) implies that, on the set i}y, 



It remains to replace by its value k'tje^ to derive that, on the set Q,y, the 
following inequality holds: 



Taking into account that (f){w{9)) > 0(1 A w{9)) > 9 for every G [0, 1], we 
deduce from the definition of that ne^ > l and, therefore, the preceding 
inequality becomes, on 



Hence, choosing large enough numerical constant warrants that Vx < 

1/2 on Q,y and, therefore, (45) yields 



We get the required probability bound (22) by setting k = + 3. The proof 
can then be easily completed by integrating the tail bound (22) to derive 
the required upper bound on the expected risk. 

4.2. Lower hounds. To prove our various lower bounds, we shall use some 
particular collections of probability distributions G T} for the random 
pair {X,Y) satisfying the margin condition (5). The purpose of the next 
lemma is to compute the Kullback-Leibler information and the Hellinger 
distance between pairs of distributions belonging to such a collection. 

Lemma 7. Let /i S [0, 1], /i he a prohahility measure on X and T he 
a collection of classifiers on X. Let [X,Y) he the coordinate mappings on 
^VxjO, 1}, and for every t G T, define Pt to he the prohahility distribution 
on A'xjO, 1} such that, under Pt, X has distribution fi and for every x £ X, 



p2 p2 c-2 p2 

'-':4: '-^^ '-^^ 






P 



s,s)> 2{p + £{s, S)) +x^ + 3el] < FiQ'y) < e"?'. 
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Y follows conditionally on X = x a Bernoulli distribution with parameter 
r]t{x). Assume that, for some partition X = Xi\J X2, one has rjt[x) = {1 + 
{2t{x) — l)h)/2 for every x £ Xi and r]t{x) = t{x) = for every x £ X2. 
Denoting by \\ ■ ||i the Li(/_f)-norm, for every s,t G T, the square Hellinger 
distance between Pg and Pt is given by 



(49) n\Pt,Ps) = {l-^/l^)\\t-s\\i, 

while, if h < 1, the Kullback-Leibler information between Pg and Pt is given 
by 

(50) IC{Pt,Ps)=hlogl^^^y\t-s\\i. 

Proof. For every p £ [0, 1], let us denote by B{p) the Bernoulli distri- 
bution with parameter p. Then 



n\B{p),B{l-p))=n\B{l-p),B{p)) = 1 - 2^p{l-p), 
while, if p G (0, 1), 

■1-p 



IC{B{p),B{l - p)) = IC{B{1 - p),B{p)) = (1 - 2p) log 



p 



Setting p= (l + /i)/2, the point is that, whenever t{x) / s(x), either r?j(x) =p 
or T]t{x) = 1— p, with r]s{x) = I — 7]t{x). Hence, 

n\Pt,Ps) = I n\Bi7^tix)),BiVsix)))ltia:)M^)dfiix) 

= \\t- s\\{H\B{p),B(\-p)) 
= ||t-s||i(l-2y^p(l-p)), 
which leads to (49). Similarly, one has 

nPuPs)= f }C{B{vt{x)),B{vs{x)))lt(^,)^s[.)dfi{x) 
= \\t - s\\^K{B{p),B{\-p)) 
= ||t-s||i(l-2p)log(^i^), 

which leads to (50), giving the proof of the lemma. □ 



Let us now turn to the proof of the lower bound which holds for general 
VC-classes. 
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4.2.1. Proof of Theorem 4. 

Proof. Since .A is a VC-class with dimension V, there exists some 
set with cardinahty V which is shattered by A. Denoting such a set by 
{xi,X2, ■ ■ ■ ,xv}, we consider the probabihty distribution fi supported by 
{xi,X2, . . . , xy} and defined by 

fi{xi) = p for 1 < i < y — 1, 

(51) 

fi{xv) = l-p{V-l), 

with p being some nonnegative parameter satisfying p(V — 1) < 1 and to be 
chosen later. Now for every element b of the hyper-cube {0, 1}^~^, let 

ilbixi) = 1(1 + {2bi - l)h) for all 1 < i < y - 1, 

Vb{xv) =0 

and define Pb as the joint distribution on ^%'x{0,l} such that, under Pi,, 
X has distribution /j, and Y given X = Xi has a Bernoulli distribution with 
parameter r]i,[xi), for every \ <i<V. The corresponding Bayes classifier 
is given by s'^{xi) = 6j for 1 < i < y — 1 and sl{xy) = 0. Since {xi, X2, ■ . ■ , xy} 
is shattered by ^, we see that G V{h,S) for every h £ {0,1}^"^ The 
first step is to relate the minimax risk over V{h,S) to the minimax risk 
over the finite subfamily {Pb, b G {0, 1}^"^} of V{h, S). Given any classifier 
t:^^{0,l}, wehavee{sl,t)=E^[\2r]b{X)-l\\t{X)-sl{X)\]>h\\t-sl\\i 
and, therefore, 

Rn{h,S)>hmi sup E;,[||s^ — s||i]. 

^^^be{o,iy-^ 

Now given an estimator s taking its values in S, we can define b taking its 
values in {0, 1}^^^ such that 

min list — sill = llsj — sill. 
6'e{o,i}v-i ^ 

Hence, by the triangle inequality, 

II* *ll ^11* ^11 ill* -^11 ^oll* -^11 

||s^ - Sfelli < ||Sb - s||i + ||s^ - s||i < 2||S{, - s||i, 

which leads to 

Rn{h,S)>^ inf sup E6[||s^-sJ||i]. 

Moreover, from Lemma 7, for every pair of elements 5, b' of the hyper-cube 
{0,l}^-^ one has 

/V-i \ 



n\Pb,Pb') = (1 - Vi-h^)\\si - si,\\i =p{i - Vi-h^) ^h^K ■ 



. i=l 
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We are now in position to apply Assouad's lemma as stated in [1], for 
i nstance . This gives ARn{h ,S) > {V - l)ph[l - V2n9], where 9 = p{l - 
) . Since 1 — y/l — h'^ < h? , choosing p = 2/ {9nh? ) implies that \j2nQ < 
2/3 and, therefore, 



at least if our choice of p satisfies the constraint p{V — 1) < 1, which a 
fortiori holds whenever h > y^iV — 1) jn. It remains to notice that if /i < 
\J {y — 1)/?^, we can use the preceding construction with h = \Jiy — 
instead of h. Of course, the corresponding family G {0,1}^"^} is in- 

cluded in 5), but also in 'P(/i, 5") as well, which means that in this case 
Rn{h, S) > {V — l)/(54n/i), completing the proof of the result. □ 

We turn now to the proof of a refined lower bound for classes of sets 
satisfying (Ajsj^d) for every > 4D. Fano's lemma is one of the classical 
tools used to build minimax lower bounds. We would rather use the following 
very convenient bound for multiple testing due to Birge (see [2]), which has 
the advantage of being relevent even when testing only two hypotheses. 

Lemma 8. Let N > 1, {Pi)o<i<N be a family of probability distribu- 
tions and (^i)o<i<Af be a family of disjoint events. Let a = mmQ<i<N Pi{Ai) . 
Then, setting K. = N-^J2f=iK^{Pi, Pi 



(52) a < 0.71V 



0), 

Ic 



ln(l + iV) 



4.2.2. Proof of Theorem 5. The basic construction is very similar to the 
one performed in the proof of Theorem 4 except that, given some integer 
N > 4:D to be chosen later, we focus on a particular subset of the hyper-cube 
{0, 1}^ instead of the hyper-cube itself. We consider the uniform probability 
distribution fi on the set {xi,X2, ■ ■ ■ ,xn} provided by assumption (Atv.d)- 
Moreover, setting 

{0,l}^ = |6G{0,l}^,f;6, = Z5|, 

we introduce for every element b of {0,1}^, rih{xi) = ^(1 + (26j — l)h) for 
all 1 < i < A^ and define Pb as the joint distribution on ^YxjO, 1} such that, 
under Pf,, X has distribution /i and Y given X = xi has a Bernoulli distri- 
bution with parameter r]i,(xi), for every 1 < i < A^. The corresponding Bayes 
classifier si is given by s^(xi) = bi for 1 < i < A^. Since {xi,X2, ■ ■ ■ ,xn} is 
the set provided by assumption {An,d)i we see that Pb £ V{h,S) for every 
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b G {0, 1}^. Arguing exactly as in the proof of Theorem 4, we notice that, 
for any classifier t, i{sl,t) > h\\t — s^||i, from which we derive that, for any 
subset C of {0; 1}^, the following lower bound holds: 

Rn{h,S)>^misnpEh[\\sl-sl\\i]. 

^ b€C b€C ° 



1 /A \ 1 



Since for every b, b' G {0, 1}^, 

\\4-4'\\i = j^[j:t,^^b:j = j^s{b,b'), 

where 5 denotes Hamming distance on {0,1}^, one has, for any subset C 
of {0,1}^, 

(53) Rn{h,S)>^misnpEh[5{b,b)], 

beC b€C 

and it remains to construct a set C with maximal cardinality such that the 
points of C are mutually sufficiently distant (w.r.t. the Hamming distance). 
This can be done thanks to a combinatorial argument due to Birge and 
Massart (see [17]). We more precisely use the version of it to be found in 
[20] and which is more convenient for our needs here. So by Lemma 8 in [20], 
since N > 4:D, we can choose C in such a way that 

6{b, b') > D/2, for every b, b' in C with b / b' , 

(54) 



log(#C) > pD log , where p = 0.233. 

For this choice of C, (53) leads to 

Rn{h, S)>^ inf rnaxPfc[6 / 6] = ^ inf (l - minP,[6 = b]) . 

4iV f'eC 4iV bgcV beC / 

We derive from (52) that, given a point bo £ C, for any estimator b, the 
following upper bound holds: 

(55) minPb(6 = 6) <a V 



bee '- log(#C)' 

where a = 0.71 and 

^ b£C,bjtb0 ^ b€C,bytb0 

For any b G {0, 1}^, we have d{b, bo) < 2D, and thanks to Lemma 7, since 

1 



4o\\i = —Ab^ bo), 
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we derive that 

— 2Dnh /l + h\ ADh^n 

Combining this inequaUty with (55) leads to 

(56) R,^n.S)>'^p^. 

provided that ADh^n < alog{^C){l — h)N, which via (54) a fortiori holds if 

(57) -—<mom/D). 

It remains to define in such a way that AD < N and (57) holds. Setting 

8n/i2 



N = la] with a ■ 



(l-/i)pa(l + log(n/i2/D))' 

since N < a, (56) leads to the desired lower bound (41) with c = pa{l — 
a)/32, at least if the constraints AD < N and (57) are satisfied. Let us first 
prove that a > A2D (which a fortiori implies that > AD). Indeed, since 
X —>■ x(l + log(x))~^ increases on [1, +oo) and n/i^ > D, we derive from the 
definition of a that a/D > 8 /{pa) > 42 and, therefore, on the one hand, the 
constraint > AD is satisfied and, on the other hand, 

iV a-1 41 

(58) — > > — . 

^ ^ a - a - A2 

Now, let us notice that, for = 21/41, one has 

1 + log(a;) < x^~^ for x > 41 

[which, by monotonicity, amounts to checking numerically that 1 +log(x) < 
x^~^ at point x = 41] or, equivalently, log(x/(l + log(2;))) > 9log{x) for x > 
41. Applying this inequality with 

Anh"^ ^ A 

X = > — > 41 

9paD Opa 

leads, by definition of a, to 

-^°K l + log(x) ) + 
Hence, since (58) means that A^ > a/{20), we get 

a. f a \ , a , , , 2 



N\og{N/D) > _log(^^j > -{l + \og{nh'/D)) 
and, therefore, (57) holds, completing the proof of Theorem 5. 
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4.2.3. Proof of Theorem 6. For every classifier t G S, we set r/t = (1 + 
(2t — l)h)/2 and define Pt as the joint distribution on Xx{0, 1} such that, 
under Pj , X has distribution fj, and Y given X = x has a Bernoulh distribu- 
tion with parameter r]t{x), for every x € X. By definition of Pt, t is the Bayes 
classifier related to Pt. This shows that the collection {Pt,t G S} is included 
in V{h,S). Arguing as in the preceding proofs of lower bounds since, for 
every classifier t £ S, i{s,t) > h\\t — s||i, then, for any finite subset C of S, 
the following lower bound is available: 

h 

Rnih, S)>- inf supEs[||s - s||i]. 

We use now an argument due to Yang and Barron [25]. Given e > 0, the idea 
is to construct an e-net (i.e., a maximal set of points such that the mutual 
distances between the elements of this net stay of order e, less or equal to 
2C£, say, for some constant C > 1). To do this, we consider an e-net C and 
a Ce-net C" of S with respect to the Li(/i)-distance. Any point of C must 
belong to some ball with radius Ce centered at some point of C" . Hence, if 
C denotes an intersection of C with such a ball with maximal cardinality, 
one has, for every t,t' gC with t^t', 

(59) e<\\t-t'\\i<2C£ 
and 

(60) log(#C) > Hi{e,S,fi) - Hi{Ce,S,fi). 

Hence, Rnih, S) > {he/2) inf5gc(l — inf^gc Ps{s = s)) and using again Lemma 8, 
we derive that Rn{h, S) > (/ie/2)(l — a), provided that /C < alog(#C), where, 
given some arbitrary point to in S, 

Thanks to Lemma 7, we know that 

h^ 




}C <2n \ sup ||t — tolli 



< 8n 



Now, using our assumption on the behavior of Hi{r],S), we easily derive 
from (60) that properly choosing C, for some positive constant Ci (depend- 
ing on Ki, K2 and r), one has, for every e < Eq, log(#C) > Cie"'^'. Therefore, 

IC ^ 8nf h' Vi^, 



log(#C) - cAl-/i 
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and we can conclude that Rn{h,S) > {he/2){l — a) whenever 

8n/ /i2 



that is, 

We may always assume that aCi/8 < Sq'^^ , so that choosing 

the constraint e < is satisfied if we assume that nh^ > 1 , and we finally 
get in this case 

Otherwise, if nh"^ < 1, we can always use the preceding lower bound with 
h = n~^l'^ instead of /i, which (at least if n > 2) leads to Rn{h, S) > C y/l/n, 
completing the proof of the lower bound. 

APPENDIX: MAXIMAL INEQUALITIES 

Our purpose is here to provide maximal inequalities for set-indexed em- 
pirical processes under either the VC-condition or an entropy with brack- 
eting assumption, and also for weighted processes under local conditions. 
Although these inequalities are essentially well known, we have not always 
found them explicitly stated in the literature in a way which was satisfactory 
for our needs. This is the reason why we have decided to remind the reader 
briefly what these results are and how they can be proved, our feeling being 
that it could make life easier for a reader who is not familiar with empirical 
process techniques. 

A.l. Random vectors and Rademacher processes. 

A. 1.1. Random vectors. We recall a simple maximal inequality for ran- 
dom vectors which easily follows from an argument due to Pisier (see [17]). 
This inequality turns out to be extremely useful for deriving chaining bounds 
for either sub-Gaussian or empirical processes. 

Lemma A.l. Let {Zf)f^jr be a finite family of real-valued random vari- 
ables. Let Tp be a convex and continuously differentiable function on [0, b) 
with <b< +00. Assume that •ip{0) = ip'{0) = and set, for every x > 0, 

ijj*{x)= sup {Xx — ip{X)). 

Ae(0,6) 
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If for every A £ (0, 5) and f £T , one has 

(A.l) logE[exp(AZj)] < VXA), 

then, if N denotes the cardinality of T , we have 

sup Zf <iP*-^{log{N)). 



E 



In particular, if for some nonnegative number v, one has ip{X) = X'^v/2 for 
every A G (0, +oo), then 



(A.2) E [sup Zfj< ^2v log(iV) , 

while, if ip{X) = X'^v/ {2(1 — cX)) for every X£ (0,1/ c), one has 
EfsupZ/-) < J2vk^g{N) + clog{N). 

The two situations where we shall apply this lemma in order to derive 
chaining bounds are the following: 

• is a finite subset of M" and Zf = J27=i^ifi^ where (ei, . . . are in- 
dependent Rademacher variables. Then, setting v = supf^jrJ2i'=i fii it is 
well known that (A.l) is satisfied with t/jiX) = X'^v/2 and, therefore. 



(A.3) 



E 



sup Vsi/i 
./6-^i=l 



<^2vlog{N). 



JT is a finite set of functions / such that ||/||oo ^ 1 and Zf = X]r=i fi^i) ~ 
E[/(^i)], where .^i, . . . are independent random variables. Then, setting 
V = supjgjp-X)r=i ^[/^ as a by-product of the proof of Bernstein's 
inequality (see [3]), assumption (A.l) is satisfied with il^{X) = A^t)/(2(1 — 
A/3)) and, therefore. 



(A.4) 



E(^supZ/j < ^2vlog{N) + llog{N). 



We are now ready to prove a maximal inequality for Rademacher processes 
which will be useful for analyzing symmetrized empirical processes. 

A. 1.2. Rademacher processes. Let ^ be a bounded subset of M"- equipped 
with the usual Euclidean norm defined by 
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and let, for any positive 6, H2{S,J-) denote the logarithm of the maximal 
number of points {f^^\ ■ ■ ■ , /^^^} belonging to J- such that ||/'--'^ — /'•■' ^ II2 > 
(5^ for every j ^ j' . It is easy to derive from the maximal inequality (A. 3) 
the following chaining inequality which is quite standard (see [12]). 

Lemma A. 2. Let T he a hounded subset ofW^ and (ei, . . . ,e„) he inde- 
pendent Rademacher variables. We consider the Rademacher process (Zf) jgjr 
defined by Zf = J27=i ^ifi f^i" every f £ Let 6 be such that supjgjp II/II2 < 
6. Then 

(A.5) E fsup Zf] < 35 y J H2i2''j~^8,T). 



j=0 



The proof being straightforward, we skip it. The interested reader will 
find a detailed proof in [16]. 

We turn now to maximal inequalities for set-indexed empirical processes. 
The VC-case will be treated via symmetrization by using the preceding 
bounds for Rademacher processes, while the bracketing case will be studied 
via a convenient chaining argument. 

A. 2. Empirical processes. Let us first fix some notation. Throughout 
this section we consider i.i.d. random variables ^1, . . . , with values in some 
measurable space Z and common distribution P. For any P-integrable func- 
tion / on Z, we define = n-^ELi /(Ci) and i^n{f) = Pn{f) - P{f). 

Given a collection of P-integrable functions /, our purpose is to control the 
expectation of supf^-pVnif) or sup f^-p —Unif), when either ={1b,B G B} 
and ;S is a VC-class or under an Li-entropy with bracketing condition on 

A. 2.1. VC-classes. In the VC-case the following result is a refinement of 
what can be found in [15]. 

Lemma A. 3. Let B be a countable VC-class with dimension not larger 
than V >1 and assume that a > is such that 

P{B) < for every B G B. 

Let 

= supUniB), 
B€B 

= sup -Vn{B) 

and 

HB = log#{Bn{^l,...,Cn}}- 
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Then there exists an absolute constant K such that 



(A.6) 



K 



provided that a > K yyK[H^]/n, and 



(A.7) 



K 



V^inW^] V E[W+]) < -aJV{l + log(a-i V 1)), 



provided that a > K ^/VlT+\\ogo^)/n . 



Proof. We use the following classical symmetrization device (see, e.g., 
[12]). Given independent random signs (ei, . . . ,£n), independent of (^i, . . . ,Cn), 
whatever the countable class of functions J-', the following inequality holds: 



(A. 



E 



sup(P„-P)(/) 



< -E 
n 



sup Vej/te) 



Applying this symmetrization inequality to the class ={1b, B £ B} and 
the sub-Gaussian inequalities for suprema of Rademacher processes (A. 3) or 
(A. 5), setting 5^ = [sup^gg Vcr^, we get either 



(A. 



E[W+]<2J-EJHb5^^ 



or if ffuniv('i 'B) denotes the universal entropy of B as defined in Section 2.4.1, 



(A.IO) HW/t] < 4=E 



l5lY.2~^^H^^,{2-i~^5n,B) 

j=0 



Then by the Cauchy-Schwarz inequality, on the one hand, (A. 9) becomes 



(A.ll) 
so that 



E[W+]<2d-JE[HB]{a^+E[W+]), 



and, on the other hand, since H^^iy{-,B) is nonincr easing, we derive from 
(A.IO) that 



nW+] < -= ^ 2-^- v'i/univ(2-^-ia, B) 



i=0 

SO that, by Haussler's bound (26), one has 



E[W^^] < 6^/— + E[t^+] 5]2^Y(j + 1) log(2) + log(a-i V 1) + 1. 

i=o 
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Setting either D = C^KIHb] or D = CV(1 + log(cj" W 1)), where C is a 
conveniently chosen absolute constant C [C = 2 in the first case and C = 
6-^/^(1 + \/2) in the second case], the following inequality holds in both 



cases: 



or, equivalently, 



n 



D D 

n y n 



which, whenever a > 2\/^^/D/n, implies that 



(A.12) 



D 



n 



The control of E[Tyg ] is very similar. This time we apply the symmetriza- 
tion inequality (A. 8) to the class ={—1b-,B G B} and derive by the same 
arguments as above that 



< 



2[a2 + E[Ty+]]. 



Hence, provided that a > 2^/3^/Djn, (A.12) implies that E[W^] < a'^/2 



which, in turn, yields E[VFg ] < i/I^/nvScj^, completing the proof of the 
lemma. □ 



The case of entropy with bracketing can be treated via some direct chain- 
ing argument. 

A. 2. 2. The entropy with bracketing assumption. We now prove a maxi- 
mal inequality via a classical chaining argument. Note that the same kind of 
result would be valid for L2(P)-entropy with bracketing conditions, but the 
chaining argument would involve adaptive truncations which are not needed 
for Li(P)-entropy with bracketing. Since Li(P)-entropy with bracketing will 
suffice for our needs, for the sake of simplicity, we content ourselves with this 
notion here. 



Lemma A. 4. Let T he a countable collection of measurable functions 
such that < / < 1 for every f ^ T , and let /o be a measurable function 
such that < /o < 1. Let 5 be a positive number such that -P(|/ — /o|) < 
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for every f ^ T and assume u H^^j'^ {u^ , , P) is integrable at 0. Then, 
setting 

the following inequality is available: 



/n( E 

provided that 4(^(5) < 



supz/n(/o - /) 



VE 



SUpi/n(/ - fo] 



< 12<^(<5), 



Proof. We first perform tlie control of E[supjgjp I'nifo — /)] • For any in- 
teger j , we set 5j = 52~^ and Hj = H^.^ {5j ,!F,P). By definition of (• , J^, P) , 
for any integer j > 1, we can define a mapping lij from T to some finite 
collection of functions such that 



(A.13) 

and 

(A.14) 



iog#n,.F<if, 



n,-/ < / with p{f - n,/) < 5| for all f^r. 



For i = 0, we choose Ho to be identically equal to /q. For this choice of Ho, 
we still have 



(A.15) 



P{\f-Ii^f\) = P{\f-f^\)<5l = 5 



for every f . Furthermore, since we may always assume that the extrem- 
ities of the brackets used to cover T take their values in [0, 1] , we also have 
for every integer j that 

o<nj/< 1. 

Noticing that since u — > H[.j{u'^,J^,P) is nonincreasing, 

and under the condition 4:ip{6) < b'^^/n^ one has i/i < b\n. Thus, since j — > 
Hj6~^ increases to infinity, the set {j > 0: Hj < Sjn} is a nonvoid interval 
of the form 

{j>0:Hj<6]n} = [0,J], 

with J > 1. For every f £ J^, starting from the decomposition 
J-i 

-Mf) = E M^,f) - M^j+if) + M^jf) - Mf), 

j=0 
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we derive, since Uj{f)<f and P{f -Uj{f))<5j, that 



J-i 



and, therefore. 



E 



(A.16) 



j=0 



sup[-i/„(/)] 



J-1 



j=0 



far 



Now, it foHows from (A. 14) and (A. 15) that, for every integer j and every 
f £ J^, one has 

p[|n,-/-n,+i/|]<<52 + 5|+i = 55|+i 

and, therefore, since \Iljf — nj_|_i/| < 1, 

p[|n,/-n,+i/|2]<55|+i. 

Moreover, (A. 13) ensures that the number of functions of the form IT-,/ — 
rij+i/ when / varies in is not larger than exp(Hj + Hjj^i) < exp{2Hj^i). 
Hence, we derive from (A. 4) that 



nE 



SUpK(nj/) - Un{Uj+if)] 



< 2 



and (A.16) becomes 
(A.17) 



sup[-z^„(/)] 



I 



+ 4^^53+1. 



<2E 

It follows from the definition of J that, on the one hand, for every j < J, 



1 



and, on the other hand. 



4^/^(53+1 <45j+i\/^j+i. 
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Hence, plugging these inequalities in (A. 17) yields 



sup[-i/„(/)] 



J+i 



and the result follows. The control of E[supjgjp — /o)] can be performed 
analogously, changing lower into upper approximations in the dyadic approx- 
imation scheme described above. □ 

A. 3. A maximal inequality for weighted processes. The following in- 
equality is more or less classical and well known. We present a (short) proof 
for the sake of completeness. Note that in the statement and the proof of 
Lemma A. 5 below we use the convention that sup^g^ g{t) = whenever A is 
the empty set. 



Lemma A. 5. Let S be a countable set, uGS and a:S'^M+ such that 
a{u) = inftesa(t). Let Z be a process indexed by S and assume that the 
nonnegative random variable sup^gg^^-j [Z(n) — Z(t)] has finite expectation for 
any positive number e, where 13(e) = {t & S, a{t) < e}. Let iIj be a nonnegative 
function on M+ such that tp[x)/x is nonincreasing on M+ and satisfies for 
some positive number 



E 



sup [Z{u) - Z{t)] 

teB(e) 



< ip{e) for any e > . 



Then, one has, for any positive number x > , 



E 



sup 



Z(u) - Z{t) 

o2(i) +x2 



<4x ^V(a;)- 



Proof. Let us introduce for any integer j 

Cj = {teS, rH < a{t) < r^+^x}, 

with r > 1 to be chosen later. Then {Bu{x) , {Cj} jx)} is a partition of S and, 
therefore. 



Z{u) - Z{t) 



which, in turn, implies that 



< sup 



{Z{u) - Z{t))+ 



+ Esup 



a2(t) + x2 
- (Z(n)-Z(t))+ 

o2(t)+x2 



sup 



Z{u) - Z{t) 
a2(t) + x2 
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(A.18) < sup {Z{u)-Z{t))^ 

+ ^(l + ,2,)-l (^Z{u)-Z{t))^. 

Since a{u) =inftesa(t), one has u G Bu{r^x) for every integer k for which 
Buir'^x) is nonempty and, therefore, 

sup {Z{u)- Z{t))^= sup {Z{u)-Z{t)). 

Hence, taking the expectation in (A.18) yields 
'Z{u)-Z{t) 



sup 

tes 



< ip{x) + r'^^)-'^ip{r^+^x). 

j>0 



a2(t) 

Now by our monotonicity assumption, ijj{r^~^^x) < r^~^^ip{x), and thus 

"Z(n) - z{ty 



sup 



a2(t) +x2 



<ij{x) 



l + r^rJ(l+r2j)-i 
i>o 



1 



, 1 1 

l + r[- + 



2 r- 1 



and the result follows by choosing r = 1 + \/2- D 
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